Skip to content

fix a memleak, close the patcher component; #13322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wjjahah
Copy link
Contributor

@wjjahah wjjahah commented Jul 3, 2025

patch_list has been destroy, but component not release

Signed-off-by: wjjahah <2457791952@qq.com>
@wjjahah wjjahah mentioned this pull request Jul 3, 2025
@jsquyres jsquyres requested review from bosilca, hjelmn and bwbarrett July 4, 2025 11:59
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the original PR, I am not entirely sure it is always safe to unload the patcher components, especially the active one, from memory. Second, I don't think we test this scenario, as I don't know what BTL/MTL that are still actively tested still requires the patcher (UCX does not).

@wjjahah
Copy link
Contributor Author

wjjahah commented Jul 7, 2025

As I said in the original PR, I am not entirely sure it is always safe to unload the patcher components, especially the active one, from memory. Second, I don't think we test this scenario, as I don't know what BTL/MTL that are still actively tested still requires the patcher (UCX does not).

In my understanding, the opal_patcher_base_close function is the entry point to close the patcher. Since this is the entry point, the caller should ensure that it is safe. In addition, the patcher resources have been released at the beginning of this function, but the component has not been released, which may be a previous omission.

@bosilca
Copy link
Member

bosilca commented Jul 7, 2025

For any other component that would be the normal behavior, but the memory patcher is different because it tracks memory management and as such is deeply integrated with the libc. Some techniques (aka most hook registrations) are seamless and can be added/removed as needed with litle impact, but some others (like the GOT table patching) are more fragile, and I think that once installed are better left active for the rest of the execution. This means that the module that installed them should never be unloaded, but that is impossible once you call the mca_base_framework_components_close (and brute force unloading them).

@wjjahah
Copy link
Contributor Author

wjjahah commented Jul 7, 2025

For any other component that would be the normal behavior, but the memory patcher is different because it tracks memory management and as such is deeply integrated with the libc. Some techniques (aka most hook registrations) are seamless and can be added/removed as needed with litle impact, but some others (like the GOT table patching) are more fragile, and I think that once installed are better left active for the rest of the execution. This means that the module that installed them should never be unloaded, but that is impossible once you call the mca_base_framework_components_close (and brute force unloading them).

I traced the code and found that mca_base_framework_components_close will only unregister the registered variables and release the component itself. If you need to do other actions, you need to define the component's own close function. If some operations of a component need to be retained, is it more reasonable to define a component's own close function?

@bwbarrett
Copy link
Member

The code is unnecessarily convoluted, but that isn't entirely true. mca_base_framework_compoents_close() calls opal_patcher_base_restore_all(), which loops around restoring the patches.

I think there's some minor risk with this patch. Given how many of the memory hooks we've removed over the years as patcher has gotten more trusted. Note that Libfabric has basically the same set of patcher code and does unload itself. Because of how NCCL + the OFI NCCL plugin works, we're unloaded the plugin during dlclose() in Libfabric every end of run.

I think we should merge this and watch MTT for a couple days, but I have no real objection to the patch.

@bosilca
Copy link
Member

bosilca commented Jul 7, 2025

Is the patcher tested with our MTT ? UCX does not use it anymore, and most of the BTLs don't need it either. So, we're basically left with libfabric with or without CUDA. How much is that codepath stressed ?

@hjelmn
Copy link
Member

hjelmn commented Jul 7, 2025

This LGTM. I agree with @bwbarrett that this should be merged and MTT watched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants